Information Extraction using Non-consecutive Word Sequences
نویسندگان
چکیده
We address an important deficiency in existing machine learning approaches for information extraction from natural language texts. Existing techniques for information extraction employ rules that exploit properties of consecutive word sequences. We argue that sequences of non-consecutive words capturing long range contextual correlations are vital features for information extraction from natural language text. We propose an efficient method that extends the a-priori algorithm to mine frequently occurring non-consecutive word sequences from a given corpus. We also perform a simplistic aggregation of feature information across multiple mentions of an entity in a document to avoid independent classification of the multiple occurrences of the entity. Experiments on some standard data sets show substantial improvements over previously reported
منابع مشابه
Integrating Word Sequences and Dependency Structures for Chemical-Disease Relation Extraction
Understanding chemical-disease relations (CDR) from biomedical literature is important for biomedical research and chemical discovery. This paper uses a k-max pooling convolutional neural network (CNN) to exploit word sequences and dependency structures for CDR extraction. Furthermore, an effective weighted context method is proposed to capture semantic information of word sequences. Our system...
متن کاملA Parallel Multikey Quicksort Algorithm for Mining Multiword Units
In the context of word associations, multiword units (sequences of words that co-occur more often than expected by chance) are frequently used in everyday language, usually to precisely express ideas and concepts that cannot be compressed into a single word. For instance, [Bill of Rights], [swimming pool], [as well as], [in order to], [to comply with] or [to put forward] are multiword units. As...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کاملKeyphrase Extraction and Grouping Based on Association Rules
Keyphrases are important in capturing the content of a document and thus useful for many natural language processing tasks such as Information Retrieval, Document Classification, and Text Summarization. Keyphrase extraction aims to identify multi-word sequences from a collection of documents that more or less correspond to keyphrases. In this paper, we propose a new method for keyphrase extract...
متن کاملAutomatic Extraction of Word Sequence Correspondences in Parallel Corpora
This paper proposes a method of finding correspondences of arbitrary length word sequences in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are evaluated by a similarity measure between the sequences defined by the co-occurrence frequency and independent frequency of the word sequences. The similarity measure is an extension of Dice coefficient. An i...
متن کامل